Skip to content

feat(inspect): add ArrowRowBuilder for materializing Arrow batches#780

Open
WZhuo wants to merge 2 commits into
apache:mainfrom
WZhuo:arrow-row-builder
Open

feat(inspect): add ArrowRowBuilder for materializing Arrow batches#780
WZhuo wants to merge 2 commits into
apache:mainfrom
WZhuo:arrow-row-builder

Conversation

@WZhuo

@WZhuo WZhuo commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds ArrowRowBuilder (arrow_row_builder_internal.h / arrow_row_builder.cc), a schema-driven RAII helper that materializes in-memory rows into an Arrow ArrowArray (a struct batch) for an arbitrary Iceberg schema. It wraps the nanoarrow boilerplate and exposes per-column access plus typed append free functions, so metadata tables (snapshots, history, manifests, …) can emit rows without re-implementing it.

This is the first of a series splitting metadata-table support into focused PRs; the InMemoryBatchReader and the SnapshotsTable::Scan integration are intended to follow in separate PRs that build on this.

What's included

  • ArrowRowBuilder — a single RAII class (move-only) with Make(const Schema&) and Make(const ArrowSchema*) overloads. Handles the full nanoarrow lifecycle: InitFromSchemaStartAppending → … append values … → FinishBuildingRelease. The ArrowArray is guarded immediately after InitFromSchema so a failure in StartAppending releases it automatically.
  • ArrowArrayGuard::Release() — added to the existing guard so other call sites (position_delete_writer, manifest_adapter) can reuse the RAII-release pattern instead of manually managing nanoarrow resources.
  • Free functions in the iceberg namespace: AppendNull, AppendBoolean, AppendInt (covers int32/int64/timestamp via nanoarrow's int64), AppendString, AppendStringMap.
  • The implementation lives at the core iceberg library level — it only needs nanoarrow + ToArrowSchema (no Apache Arrow), matching peers like manifest_adapter and arrow_c_data_util.
  • Unit tests in arrow_row_builder_test.cc covering typed appends (int32/string/int64/boolean/map), null handling for optional columns, multi-entry/empty string maps, zero-row batches, and column-index bounds. Compiled into the iceberg-data-test test target.

Testing

  • CMake (Ninja): cmake --build build --target iceberg-data-test then ran the test binary — 5/5 ArrowRowBuilderTest tests pass. ctest green.
  • The test verifies output by importing the produced C-data into Apache Arrow (arrow::ImportRecordBatch), so its target is under USE_BUNDLE.

Notes

  • The test is registered under CMake's bundle build only. The meson build (which has no Apache Arrow/bundle layer) is left unchanged; the core-only test target continues to build there.
  • Developed with AI-assisted tooling, reviewed by the author.

Add ArrowRowBuilder (inspect/row_builder_internal) to materialize
in-memory rows into an ArrowArray for an arbitrary Iceberg schema,
with typed append helpers (AppendNull/Boolean/Int/String/StringMap)
reused by later metadata tables.
@WZhuo WZhuo force-pushed the arrow-row-builder branch from b0a8615 to b42f0da Compare June 25, 2026 03:38
Comment thread src/iceberg/CMakeLists.txt Outdated
Comment thread src/iceberg/inspect/row_builder_internal.h Outdated
Comment thread src/iceberg/inspect/row_builder_internal.cc Outdated
@wgtmac

wgtmac commented Jun 25, 2026

Copy link
Copy Markdown
Member

I think this is a good point to factor out a small internal nanoarrow builder/helper layer instead of adding another local copy of the same lifecycle code.

We already have the same pattern in a few production paths: initialize from schema, start appending, append child values, finish each row/list/map element, then finish building. For example, position_delete_writer.cc and arrow_c_data_util.cc both guard the initialized ArrowArray immediately after ArrowArrayInitFromSchema, while this new builder currently misses that and can leak if ArrowArrayStartAppending fails. manifest_adapter.cc also has local primitive/list/map append helpers, including map-shape validation, and this PR adds another set of primitive/string-map helpers.

A concrete way to keep this scoped would be:

  1. Add an internal move-only nanoarrow array builder/owner that handles init/start/finish/release/ownership transfer.
  2. Move the simple append helpers (null, signed/unsigned int, double, string, bytes, and maybe map entry validation) behind that internal helper.
  3. Keep higher-level code like ManifestAdapter, batch projection, and metadata-table row materialization separate, since those have different domain semantics.

That should fix the leak path here and avoid repeating this fragile nanoarrow lifecycle in the next metadata-table PRs without turning this into a broad refactor.

WZhuo added a commit to WZhuo/iceberg-cpp that referenced this pull request Jun 29, 2026
- Rename row_builder_internal.cc → arrow_row_builder.cc (no _internal suffix)
- Move files from inspect/ to core level; rename to arrow_row_builder*
- Merge NanoarrowArrayBuilder into ArrowRowBuilder as a single RAII class
- Add Make(const ArrowSchema*) overload for low-level callers
- Add ArrowArrayGuard::Release() and guard InitFromSchema in Make()
- Move test to arrow_row_builder_test.cc in data_test target
@WZhuo

WZhuo commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review. All three points addressed:

  1. ArrowRowBuilder is now a single RAII class (merged the two classes), moved to core level, with Make(const Schema&) and Make(const ArrowSchema*) overloads so both metadata tables and lower-level code like position_delete_writer/manifest_adapter can use it.

  2. Append* helpers are free functions alongside ArrowRowBuilder: null, int (int32/int64/timestamp), boolean, string, and string_map.

  3. The lifecycle is properly guarded: ArrowArrayGuard right after InitFromSchema, Release() after StartAppending succeeds. Also added Release() to ArrowArrayGuard for other call sites to reuse this pattern.

WZhuo added a commit to WZhuo/iceberg-cpp that referenced this pull request Jun 29, 2026
- Rename row_builder_internal.cc → arrow_row_builder.cc (no _internal suffix)
- Move files from inspect/ to core level; rename to arrow_row_builder*
- Merge NanoarrowArrayBuilder into ArrowRowBuilder as a single RAII class
- Add Make(const ArrowSchema*) overload for low-level callers
- Add ArrowArrayGuard::Release() and guard InitFromSchema in Make()
- Move test to arrow_row_builder_test.cc in data_test target
@WZhuo WZhuo force-pushed the arrow-row-builder branch from b82b9ba to f592f6d Compare June 29, 2026 07:54
WZhuo added a commit to WZhuo/iceberg-cpp that referenced this pull request Jun 29, 2026
- Rename row_builder_internal.cc → arrow_row_builder.cc (no _internal suffix)
- Move files from inspect/ to core level; rename to arrow_row_builder*
- Merge NanoarrowArrayBuilder into ArrowRowBuilder as a single RAII class
- Add Make(const ArrowSchema*) overload for low-level callers
- Add ArrowArrayGuard::Release() and guard InitFromSchema in Make()
- Move test to arrow_row_builder_test.cc in data_test target
@WZhuo WZhuo force-pushed the arrow-row-builder branch from f592f6d to d221034 Compare June 29, 2026 07:58
WZhuo added a commit to WZhuo/iceberg-cpp that referenced this pull request Jun 29, 2026
- Rename row_builder_internal.cc → arrow_row_builder.cc (no _internal suffix)
- Move files from inspect/ to core level; rename to arrow_row_builder*
- Merge NanoarrowArrayBuilder into ArrowRowBuilder as a single RAII class
- Add Make(const ArrowSchema*) overload for low-level callers
- Add ArrowArrayGuard::Release() and guard InitFromSchema in Make()
- Move test to arrow_row_builder_test.cc in data_test target
@WZhuo WZhuo force-pushed the arrow-row-builder branch from d221034 to 1212581 Compare June 29, 2026 08:29
- Rename row_builder_internal.cc → arrow_row_builder.cc (no _internal suffix)
- Move files from inspect/ to core level; rename to arrow_row_builder*
- Merge NanoarrowArrayBuilder into ArrowRowBuilder as a single RAII class
- Add Make(const ArrowSchema*) overload for low-level callers
- Add ArrowArrayGuard::Release() and guard InitFromSchema in Make()
- Move test to arrow_row_builder_test.cc in data_test target
@WZhuo WZhuo force-pushed the arrow-row-builder branch from 1212581 to 030c221 Compare June 29, 2026 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants